"The first million is hardest to get": Building a Large Tagged Corpus as Automatically as Possible

نویسنده

Gunnel Källgren

چکیده

The paper describes a recently started project in Sweden. The goal of the project is to produce a corpus of (at least) one million words of running text from different genres, where all words are classified for word class and for a set of morphosyntactic properties. A set of methods and tools for automating the process are being developed and will be presented, and problems and some solutions in connection with e.g. homography disambiguation will be discussed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs

Many elements contribute to the relative difficulty in acquiring specific aspects of English as a foreign language (Goldschneider & DeKeyser, 2001). Modal auxiliary verbs (e.g. could, might), are examples of a structure that is difficult for many learners. Not only are they particularly complex semantically, but especially in the Malaysian context ...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

UCSG: A Wide Coverage Shallow Parsing System

In this paper, we propose an architecture, called UCSG Shallow Parsing Architecture, for building wide coverage shallow parsers by using a judicious combination of linguistic and statistical techniques without need for large amount of parsed training corpus to start with. We only need a large POS tagged corpus. A parsed corpus can be developed using the architecture with minimal manual effort, ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1990

"The first million is hardest to get": Building a Large Tagged Corpus as Automatically as Possible

نویسنده

چکیده

منابع مشابه

Corpus based coreference resolution for Farsi text

How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs

PAYMA: A Tagged Corpus of Persian Named Entities

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

UCSG: A Wide Coverage Shallow Parsing System

عنوان ژورنال:

اشتراک گذاری